2 Part II Advanced displays

2.1 Preparation

You will find the material for this lecture at https://github.com/cbhurley/CRT2021vis

First install these packages

install.packages(c("GGally", "ISLR2", "naniar"))

and load with

library(GGally)
library(ISLR2)
library(timetk) # for the data
library(palmerpenguins)
library(tidyverse)
library(naniar)

bike <- bike_sharing_daily
bike$season <- c("Winter","Spring", "Summer","Fall")[bike$season]
bike$season <- factor(bike$season, levels = c("Winter","Spring", "Summer","Fall"))  

bike$mnth <- factor(bike$mnth)

bike$holiday <- factor(c("No","Yes")[bike$holiday+1])
bike$workingday <- factor(c("No","Yes")[bike$workingday+1])

bike$yr <- factor(c("2011","2012")[bike$yr+1])

bike$weathersit <- c("clear", "cloudy", "lightP", "heavyP")[bike$weathersit]
bike$weathersit <- factor(bike$weathersit, levels= c("clear", "cloudy", "lightP", "heavyP"))
bike <- droplevels(bike)

2.2 Big data

ISLR2 has an hourly version of the bike data with 8645 observations.

The plot of registered versuscasual users is

ggplot(data=Bikeshare, aes(x=casual, y=registered)) + geom_point()

There is likely lot of overplotting.

Jittering will randomly undo rounding, and using alpha will help too

ggplot(data=Bikeshare, aes(x=casual, y=registered)) + 
  geom_jitter(alpha=.5, size=.5)

Other options are to form 2d bins, count the number of observations in each, and color in proportion to the frequency.

p <- ggplot(data=Bikeshare, aes(x=casual, y=registered))
p+ geom_bin2d()

I prefer to make my colour scale light to dark:

p +  
  geom_bin2d() +
  scale_fill_gradient(low = "lightblue1", high = "steelblue4")

p +  
  geom_bin2d() +
  scale_fill_gradient(low = "lightblue1", high = "steelblue4" ,trans="log10")

The second plot takes a log of the counts so you get a better spread of colours

p +  
  stat_binhex() +
  scale_fill_gradient(low = "lightblue1", high = "steelblue4" ,trans="log10")

stat_binhex uses hexagonal binning

A more advanced plot fits a 2d density estimate and draw this as a contour plot.

p + stat_density2d()

# fills in the contours
p + stat_density2d(aes(fill = after_stat(level)), geom = "polygon") 

2.3 Scatterplot matrix

Back to the smaller dataset.

Go to articles tab for info https://ggobi.github.io/ggally/

library(GGally)
ggpairs(bike, columns=c("casual","registered","cnt" ))

You can change what is plotted on lower, upper and diagonal,using lower= etc. Putting in mapping= gives densities for both years separately

ggpairs(bike, mapping = aes(color = yr),
            columns=c("casual","registered","cnt" ),
  lower = list(continuous =  wrap("smooth", method="lm", se=F, alpha=.5)),
  diag = list(continuous = wrap("densityDiag", alpha=0.5 )))

ggpairs can handle factors as well. In this plot yrs is in as a plot variable, and as colour.

ggpairs(bike, mapping = aes(color = yr),
        columns=c("casual","registered","cnt","yr"),
        lower = list(continuous =  wrap("smooth", method="lm", se=F, alpha=.5)),
        diag = list(continuous = wrap("densityDiag", alpha=0.5 )))

2.4 Parallel coordinates

Construction:

s <- sample(nrow(bike),20)
ggparcoord(bike[s,], columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "yr",
           showPoints=TRUE,
           alphaLines=0)

ggparcoord(bike[s,], columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "yr",
           showPoints=TRUE,
           alphaLines=1)

ggparcoord(bike[s,], columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "yr")

By default, all variables are scaled to mean zero and standard deviation 1, individually. Variables that are positively correlated will have parallel line segments, here registered and cnt.

Other scaling options are:

ggparcoord(bike[s,], columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "yr", scale="uniminmax")

ggparcoord(bike[s,], columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "yr", scale="globalminmax")

In globalminmax, no scaling is done. Here you can see that most users per day are registered.

To see what certain patterns look like in a parallel coordinate plot, let us look at some fake data

x <- rnorm(100)
y <- x+ .3*rnorm(100)
z <- -y+ .1*rnorm(100)
w <- -z+ .5*rnorm(100)
ggparcoord(data.frame(x,y,z,w), scale="uniminmax")

Positive correlation exhibits as parallel segments.

Negative correlation has crossing. The stronger the negative correlation the smaller the pinch point.

ggparcoord(bike, columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "workingday",
           alphaLines=.4)

In the above we see that there is a difference in the casual-registered association for workingday or not. Both groups have positive association but slopes are different. There are more casual users at the weekend.

Verifying with a scatterplot

ggplot(data=bike, aes(x=casual, y=registered, color=workingday)) + geom_point()

The same pattern was evident in the full bike data.

2.5 Practice 3

Using the penguins dataset, make this plot

Which pair of variables and species has the highest correlation?

Can you find two variables with negative correlation, but positive correlation within each species? This is called Simpson’s paradox.

Using ggparcoord, plot all the numeric variables. Use species as color.

Can you identify presence of positive and negative correlation?

2.6 Missing data

Here we use package naniar. This is the dataset for your assignment. All the NAs are marked in grey.

m <- read_csv("marathon.csv",na=c(""," ","NA"))
vis_miss(m)

An upset plot shows the combination of missing values. Most of the NAs involve the club variable.

gg_miss_upset(m[,-c(22,20)])

Check out the visualisations in naniar for more options, eg the vignette at https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html

2.7 Titanic case study

There were 2201 people aboard the Titanic when it sank in 1912. A summary of the people on board and whether they survived or not is given in the Titanic dataset.

Here are some images as it left its last port of call, Cobh.

Can we figure out from this data which category of passenger had the highest chance of survival?

This case study looks at 4 categorical variables simultaneously: - Survival (yes or no), - class (first, second, third, crew,) - gender (male or female) and - age (child or adult).

First we look at the number of survivors by class

#Titanic  # a 4d array....
tit <- as.data.frame(Titanic)
head(tit)
##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0
tit<- tit[c(17:32,1:16),]

ggplot(data = tit, aes(x = Class, y = Freq,fill=Survived)) +
  geom_col() 

and then the survival rates by class

ggplot(data = tit, aes(x = Class, y = Freq,fill=Survived)) +
  geom_col(position="fill") 

  • There is a big difference in Survival rates across the classes.
  • First class passengers have the highest chance of survival, crew members the lowest.

We can bring in gender with facet_wrap

ggplot(data = tit, aes(x = Class, y = Freq,fill=Survived)) +
  geom_col(position="fill") + 
  facet_wrap( ~ Sex) + ylab("Rate")

  • Third class females do worse than crew females (46% versus 87%), and third class males do worse than crew males (17% versus 22%), although when gender is ignored the situation is reversed.

  • This apparent paradox (Simpson’s paradox) occurs because the vast majority of crew members are male.

  • The overall 3rd class survival rate of 25% is an average of 46% and 17%, but as there are 2.6 times more men than women in 3rd class 17% gets more weight.

  • The overall crew survival rate of 24% is an average of 87% and 22%, but as there are 37 times more men than women in crew class 87% gets very little weight.

Finally, bringing in age

ggplot(data = tit, aes(x = Class, y = Freq,fill=Survived)) +
  geom_col(position="fill") + 
  facet_wrap( ~ Age+ Sex) + ylab("Rate")

We can get some simplification by showing the survival rate, that is omit the red sections.

titR <- 
tit %>%  group_by(Class, Sex,Age) %>%
summarise(died=Freq[2], survived=Freq[1], rate=survived/sum(Freq))
titR
## # A tibble: 16 × 6
## # Groups:   Class, Sex [8]
##    Class Sex    Age    died survived     rate
##    <fct> <fct>  <fct> <dbl>    <dbl>    <dbl>
##  1 1st   Male   Child     0        5   1     
##  2 1st   Male   Adult   118       57   0.326 
##  3 1st   Female Child     0        1   1     
##  4 1st   Female Adult     4      140   0.972 
##  5 2nd   Male   Child     0       11   1     
##  6 2nd   Male   Adult   154       14   0.0833
##  7 2nd   Female Child     0       13   1     
##  8 2nd   Female Adult    13       80   0.860 
##  9 3rd   Male   Child    35       13   0.271 
## 10 3rd   Male   Adult   387       75   0.162 
## 11 3rd   Female Child    17       14   0.452 
## 12 3rd   Female Adult    89       76   0.461 
## 13 Crew  Male   Child     0        0 NaN     
## 14 Crew  Male   Adult   670      192   0.223 
## 15 Crew  Female Child     0        0 NaN     
## 16 Crew  Female Adult     3       20   0.870
ggplot(data = titR, aes(x = Class, y = rate)) +
  geom_col(fill="lightblue") + 
  facet_wrap( ~ Age+ Sex) 

2.8 Practice 4

Explore the survival rates for the data lusitania.csv on https://github.com/cbhurley/CRT2021vis This is a tabulated version of the data, in the same format as tit. The complete dataset is available on Kaggle.

You can find out more about the shipwreck https://en.wikipedia.org/wiki/RMS_Lusitania


---
title: "Data visualisation in R"
author: "Catherine Hurley"
date: "7/9/2021"
output:
  html_document:
    theme: readable
    code_download: true
    toc: true
    toc_depth: 2
    number_sections: true
    pandoc_args: ["--number-offset=1"]
editor_options: 
  chunk_output_type: console
  
---

<!-- <style type="text/css"> -->
<!--   body, td { -->
<!--     font-size: 14pt; -->
<!--   } -->
<!-- code.r{ -->
<!--   font-size: 12pt; -->
<!-- } -->
<!-- pre { -->
<!--   font-size: 12pt -->
<!-- } -->
<!-- </style> -->









```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=F)
```
# Part II Advanced displays

## Preparation

You will find the material for this lecture at <https://github.com/cbhurley/CRT2021vis>

First install these packages

```{r, eval=F}
install.packages(c("GGally", "ISLR2", "naniar"))
```

and load with

```{r}
library(GGally)
library(ISLR2)
library(timetk) # for the data
library(palmerpenguins)
library(tidyverse)
library(naniar)

bike <- bike_sharing_daily
bike$season <- c("Winter","Spring", "Summer","Fall")[bike$season]
bike$season <- factor(bike$season, levels = c("Winter","Spring", "Summer","Fall"))  

bike$mnth <- factor(bike$mnth)

bike$holiday <- factor(c("No","Yes")[bike$holiday+1])
bike$workingday <- factor(c("No","Yes")[bike$workingday+1])

bike$yr <- factor(c("2011","2012")[bike$yr+1])

bike$weathersit <- c("clear", "cloudy", "lightP", "heavyP")[bike$weathersit]
bike$weathersit <- factor(bike$weathersit, levels= c("clear", "cloudy", "lightP", "heavyP"))
bike <- droplevels(bike)
```


## Big data

ISLR2 has an hourly version of the bike data with 8645  observations.

The plot of registered versuscasual users is

```{r,  fig.width=4.5, fig.height=4, fig.align="center"}
ggplot(data=Bikeshare, aes(x=casual, y=registered)) + geom_point()
```

There is likely lot of overplotting.

Jittering will randomly undo rounding, and using alpha will help too

```{r,  fig.width=4.5, fig.height=4, fig.align="center"}
ggplot(data=Bikeshare, aes(x=casual, y=registered)) + 
  geom_jitter(alpha=.5, size=.5)
```

Other options are to form 2d bins, count the number of observations in each, and color in
proportion to the frequency.

```{r,  fig.width=5, fig.height=4, fig.align="center"}
p <- ggplot(data=Bikeshare, aes(x=casual, y=registered))
p+ geom_bin2d()
```

I prefer to make my colour scale light to dark:
```{r,  fig.width=5, fig.height=4, fig.align="center"}
p +  
  geom_bin2d() +
  scale_fill_gradient(low = "lightblue1", high = "steelblue4")

p +  
  geom_bin2d() +
  scale_fill_gradient(low = "lightblue1", high = "steelblue4" ,trans="log10")
```

The second plot takes a log of the counts so you get a better spread of colours

```{r,  fig.width=5, fig.height=4, fig.align="center"}
p +  
  stat_binhex() +
  scale_fill_gradient(low = "lightblue1", high = "steelblue4" ,trans="log10")
```
`stat_binhex` uses hexagonal binning

 A more advanced plot fits a 2d density estimate and draw this as a contour plot.
 
```{r,  fig.width=5, fig.height=4, fig.align="center"} 
p + stat_density2d()
```  
```{r,  fig.width=5, fig.height=4, fig.align="center", eval=F} 
# fills in the contours
p + stat_density2d(aes(fill = after_stat(level)), geom = "polygon") 
```    

## Scatterplot matrix

Back to the smaller dataset.

Go to articles tab for info
<https://ggobi.github.io/ggally/>


```{r, fig.width=4.5, fig.height=4, fig.align="center"}
library(GGally)
ggpairs(bike, columns=c("casual","registered","cnt" ))
```

You can change what is plotted on lower, upper and diagonal,using
`lower=` etc.
Putting in `mapping=` gives densities for both years separately


```{r, fig.width=4.5, fig.height=4, fig.align="center"}

  
ggpairs(bike, mapping = aes(color = yr),
            columns=c("casual","registered","cnt" ),
  lower = list(continuous =  wrap("smooth", method="lm", se=F, alpha=.5)),
  diag = list(continuous = wrap("densityDiag", alpha=0.5 )))

```


`ggpairs` can handle factors as well.
In this plot `yrs` is in as a plot variable, and as colour.

```{r, fig.width=4.5, fig.height=4, fig.align="center"}


ggpairs(bike, mapping = aes(color = yr),
        columns=c("casual","registered","cnt","yr"),
        lower = list(continuous =  wrap("smooth", method="lm", se=F, alpha=.5)),
        diag = list(continuous = wrap("densityDiag", alpha=0.5 )))
 
```

## Parallel coordinates

Construction:

```{r, fig.width=6, fig.height=3, fig.align="center"}

s <- sample(nrow(bike),20)
ggparcoord(bike[s,], columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "yr",
           showPoints=TRUE,
           alphaLines=0)

ggparcoord(bike[s,], columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "yr",
           showPoints=TRUE,
           alphaLines=1)

ggparcoord(bike[s,], columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "yr")
```

By default, all variables are scaled to mean zero and standard deviation 1, individually.
Variables that are positively correlated will have parallel line segments, here registered and cnt.

Other scaling options are:

```{r, fig.width=6, fig.height=3, fig.align="center"}

ggparcoord(bike[s,], columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "yr", scale="uniminmax")
ggparcoord(bike[s,], columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "yr", scale="globalminmax")

```
In globalminmax, no scaling is done. Here you can see that most users per day are registered.

To see what certain patterns look like in a parallel coordinate plot, let
us look at some fake data

```{r, fig.width=6, fig.height=3, fig.align="center"}

x <- rnorm(100)
y <- x+ .3*rnorm(100)
z <- -y+ .1*rnorm(100)
w <- -z+ .5*rnorm(100)
ggparcoord(data.frame(x,y,z,w), scale="uniminmax")
```

Positive correlation exhibits as parallel segments.

Negative correlation has crossing. The stronger the negative correlation the smaller the pinch point.




```{r,fig.width=6, fig.height=3, fig.align="center"}
ggparcoord(bike, columns=match(c("casual","registered","cnt"),names(bike)),
           groupColumn = "workingday",
           alphaLines=.4)
```

In the above we see that there is a difference in the casual-registered association for workingday or not.
Both groups have positive association but slopes are different.
There are more casual users at the weekend.

Verifying with a scatterplot

```{r,  fig.width=5, fig.height=4, fig.align="center"}
ggplot(data=bike, aes(x=casual, y=registered, color=workingday)) + geom_point()
```

The same pattern was evident in the full bike data.


## Practice 3

Using the `penguins` dataset, make this plot

```{r, fig.width=6, fig.height=5, fig.align="center", echo=F}
ggpairs(na.omit(penguins), mapping = aes(color = species),
            columns=c(3:7),
  lower = list(continuous =  wrap("smooth", method="lm", se=F, alpha=.5)),
  diag = list(continuous = wrap("densityDiag", alpha=0.5 )))
```
Which pair of variables and species has the highest correlation?

Can you find two variables with negative correlation, but positive correlation within each species?
This is called Simpson's paradox.

Using `ggparcoord`, plot all the numeric variables. Use species as color.

Can you identify presence of positive and negative correlation?


```{r,fig.width=6, fig.height=3, fig.align="center", echo=F, eval=F}


ggparcoord(penguins, columns=which(sapply(penguins, is.numeric)),
           groupColumn = "species",
           alphaLines=.4)

ggparcoord(penguins, columns=which(sapply(penguins, is.numeric)),
           groupColumn = "island",
           alphaLines=.4)

ggparcoord(penguins, columns=which(sapply(penguins, is.numeric)),
           groupColumn = "sex",
           alphaLines=.4)

```

## Missing data

Here we use package naniar.
This is the dataset for your assignment. All the NAs are marked in grey.

```{r,fig.align="center"}
m <- read_csv("marathon.csv",na=c(""," ","NA"))
vis_miss(m)
```

An upset plot shows the combination of missing values.
Most of the NAs involve the club variable.

```{r,fig.align="center"}
gg_miss_upset(m[,-c(22,20)])
```

Check out the visualisations in `naniar` for more options, eg the vignette at <https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html>


## Titanic case study

There were 2201 people aboard the Titanic when it sank in 1912.
A summary of the people on board and whether they survived or not is given in the Titanic dataset.

Here are some images as it left its last port of call, Cobh.

```{r, echo=F, eval=T,  out.width="30%",fig.align='center', fig.show='hold'}
knitr::include_graphics(c("figs/titanic.png"))
knitr::include_graphics(c("figs/cobh.png"))
```


Can we figure out from this data which category of passenger had the highest chance of survival?

This case study looks at 4 categorical variables simultaneously:
- Survival (yes or no), 
- class (first, second, third, crew,)
- gender (male or female) and 
- age (child or adult).


First we look at the number of survivors by class
```{r,  fig.width=4.5, fig.height=3, fig.align="center"}
#Titanic  # a 4d array....
tit <- as.data.frame(Titanic)
head(tit)
tit<- tit[c(17:32,1:16),]

ggplot(data = tit, aes(x = Class, y = Freq,fill=Survived)) +
  geom_col() 
```

and then the survival rates by class

```{r,  fig.width=4.5, fig.height=3, fig.align="center"}
ggplot(data = tit, aes(x = Class, y = Freq,fill=Survived)) +
  geom_col(position="fill") 
```

- There is a big difference in Survival rates across the classes.
- First class passengers have the highest chance of survival, crew members the lowest. 


We can bring in gender with `facet_wrap`

```{r,  fig.width=5.5, fig.height=3, fig.align="center"}
ggplot(data = tit, aes(x = Class, y = Freq,fill=Survived)) +
  geom_col(position="fill") + 
  facet_wrap( ~ Sex) + ylab("Rate")
```

- Third class females do worse than crew females (46% versus 87%), and
third class males do worse than crew males (17% versus 22%), although when gender is ignored the situation is
reversed.

- This apparent paradox (Simpson's paradox) occurs because the vast majority of crew members are male.

- The overall 3rd class survival rate of 25% is an average of 46% and 17%, but as there are 2.6 times
more men than women in 3rd class 17% gets more weight.

- The overall crew survival rate of 24% is an average of 87% and 22\%, but as there are 37 times
more men than women in crew class 87% gets very little weight.

Finally, bringing in age
```{r,  fig.width=5.5, fig.height=4.5, fig.align="center"}
ggplot(data = tit, aes(x = Class, y = Freq,fill=Survived)) +
  geom_col(position="fill") + 
  facet_wrap( ~ Age+ Sex) + ylab("Rate")
```

We can get some simplification by showing the survival rate, that is omit the red sections.


```{r,  fig.width=4.5, fig.height=4.5, fig.align="center"}
titR <- 
tit %>%  group_by(Class, Sex,Age) %>%
summarise(died=Freq[2], survived=Freq[1], rate=survived/sum(Freq))
titR

ggplot(data = titR, aes(x = Class, y = rate)) +
  geom_col(fill="lightblue") + 
  facet_wrap( ~ Age+ Sex) 

```

## Practice 4

Explore the survival rates for the data `lusitania.csv` on <https://github.com/cbhurley/CRT2021vis>
This is a tabulated version of the data, in the same format as `tit`. The complete dataset
is available on Kaggle.

You can find out more about the shipwreck <https://en.wikipedia.org/wiki/RMS_Lusitania>



## Recommended book

"The plot", Jean Hanff Korelitz

```{r, echo=F, eval=T, out.width="25%", fig.align='center'}
knitr::include_graphics("figs/plot.png")
```

```{r, echo=FALSE, eval=FALSE, purl=FALSE}
knitr::purl("lect2.Rmd")

```


